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(54) Method of using primary and secondary processors 

(57) The invention relates to the compilation of 
source code to a primary and a secondary processor, ft mrx» code 
relates to reconfigurable secondary processors, and is r^/JAVA> 
espedaOy relevant to secondary processors which can 
be reconfigured to some degree during execution of 
code. Selective extraction of dataf kMvs from the source 
code is fofiowed by transformation of the extracted data- 
flouvs into trees. The trees are \h&\ matched against 
each other to determine minimum edit cost relation- 
ships for transformation of one tree into another, where 
these minirmjm edit cost relationships are determined 
by the architecture of the secondary proc^sor. A group 
or a plurality of groups of dataflows is determined on the 
t>asis of said miramum edit cost relationships and for 
each group a generic dataflow capable of supporting 
each dataflow in that group is created. The generic 
dataflow or dataflows is then used to determine the 
hardware configuration of the secondary processor; 
and calls to the secondary processor for said group or 
plurality of. groups of dataflow© are substituted into the 
source code. The resultant source code is conrpled to 
the primary processor. 

The resulting efficient configuratiwi thus reduces 
either the expense of reconfiguration (in a field program- 
mat^e array), or the silicon area fm an application spe- 
cific integrated circuit). 
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Description 

[dOOl j The presert invention relates to the compilation and e)©cution of source code for a processor architecture con- 
sisting of a primary processor and one (or more) secondary processors. The invention is particulaily. though not exdu- 
isively. relevant to the architectures employing a reconfigurable secondary processor. 

{0002] A primary processor - sudi as a Pentium processor in a conventicml PC (Pentium fe a Trade Mark of Intel 
Corporation) - has evolved to be versatile, in that it is adapted to Imndle a wide rage of conputational tasks without 
tjeing optimised lor any of them. Such a processor is thus not optimised to handle efficiently computationally intensive 
operations, such as paraOel si4>-word tasks. Such tasks can cause significant Ixrtllenecks in the exeartion of code 
[0003] An approach tatoi to solve this problem is the development of integated circuits spedficaUy adapted for par- 
tiailar applications. These are known as ASICs, or appIication-specTic integrated crcuits. Tasks for which such a ASIC 
is adapted are generally performed very well: however, flie ASIC will generally perform poorly, if at all. on tasks for which 
It IS not configured, aearly. a spectfic IC can be built for a particular afplication. but this is not a desiraWe solution for 
applications that are not central to the operation of a computer, or are not yet detenrined at the time of building the com- 
puter. It 6 thus particularly advantageous for a ASIC to be reconfigurable. so that it can be optimised for different appB- 
cations as required, The commonest fomi of arcKteclure for such devices is the field programmable gate anay (FPGAj 
a fine^ned processor structure which can be configured to have a staiclure v»rhich is suited to any given application* 
Such staictures can be used as independent processors in suitable contexts, but are also particulariy apDroDriate to 
use as coprocessors. ^ 
10004] Such configurable coprocessors have the potential to inrprove the perfonnance of the primary processor For 
particular tasks, coderunineffidentlyby the primary processor can be extracted and run more effidentiy in an adapted 
coprocessor which has been optimised for that appfication. With continued development of such "application-specific" 
s^XMTdary processors, the possibility of improving performance by extracting cfifficult code to a custom coprocessor 
becomes more attractive. A particulariy important example in general computing is the extraction of loop bodies in 
image handling. 

[0005] To obtain the desired efficiency gains, it is necessary to determine as effectively as possible how code is to be 
divided between pnmary and secondary processors, and to configure the seconcfary processor for optimal execution of 
Its assigned part of the code. One approach fe to mark the code a^rc^ately on Its creation for mapping to coproces- 
Mr ^mctures. In "A C + + compiler for FPGA custom execution units; synthesis", Christian Iseli and Eduardo Sanchez 
IEEE Symposium on FPGAs for Custom Computing Machines. Napa. California. April 1995. a approach is employed 
'Hl'!^ :^<v^n^Ping of C + + to FPGAs in VUW (Very-Long Instruction Word) structures after appropriate teggina 
ofttte initial code by the programmer. This approach relies on the initial pix)grammer making a good choice of ccxie to 
extract initially. 

m06] An alternative approach is to assess the initial code to detmnine which the most appropriate elements to direct 
to the sa»ndary processor win ba Two^^el Hardware^Soflware PartHioning Using CoDe-X", Reiner W. Hartenstein 
Jurgen Becker and Rainer Kress, in Int. IEEE Symposium on Engineering of Conputer Based Systems (ECBS) Frie^ 
drctish^ea Gennany, March 1 996. discusses a codesign topi which incorporates a profiler to assess which parts of a 
mrtial code are suitable tor allocation to a coprocessor and which should be reserved for the primary processoTTWs is 
tonowed by a rterahve procedure allowing for compilation of a subset of C code to a reconfigurable coprocessor archi- 
terture so that the extracted code can be mapped to the coprocessor. This approach does expand the usage of sec- 
ondary processors, but does not fully realize the potential of reconfiguiable logic. 

[0007J Comparable approaches have been proposed in the BRASS research project at the University of Beri«ley. An 
^'^^^.^^^.r^^*'^*^^ "^'"9 ^ Placement-. Tim Callahan & John Wawrzynek. a piter 
pr^ented at FCCW97. Symposium on Reld-PnDgrammable Custom Computing IWIachines, April 16-18 1997 Napa 
JT^"^ ^'"^^ °" *^ Wortd Wide web at http7Avww.cs W 
leyedu4)rojectsA>rassA|c_fccm_poster_thumb.ps). uses template structures representative of a FPGA archrtecture to 
!^J!L'" °' ^'"^ code on to FPGA structures. Source code samples are rendered as directed acycHc 

^ r I^*^ *° 9^ «»^epte are set out. for example, in "High 

K^7^r^!L "^^'^^ Computing-. Michael WoHe. pages 49 to 56. Addison-Wesley. Redwood City 1996 

butabnefdefinibonofaDAGandatreefollowshere. ' 
S^^ji graph consists of a set of nodes, and a set of edges: each edge is defined by a pair of nodes (and can be 
owKidtfed graphically as a Dne joining those nodes). A graph can be either directed or undirected: in a directed giaph 

^ Hl^" " " *° "^'"^ ^ « node back to itself, then the graph 

2S A tr^f f ^Jf • ^^'^'^^ that Is both directed and acycfic: it is thus a hierarchical struc- 
^^^^1^^ ^"l"^ ^ ^ ^ ^"9'^ ^'''^ and there is a unique path from 

^ ® * one or more "chUd nodes*, but a child node can have only one par- 

ent, whereas mageneral DAG. a child can have more than one parent Nodesof a tree with no children are timed leaf 
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nodes. 

[0009] In the work of Tim Callahan & John Wawraynek, these trees are matched with the FPGA structure by use of a 
"tree covering* program railed Iburg. Iburg is a generally available software tool, and its application fe descrflijai in *A 
RetargetaWe C.Compiler; Design and In^ementation*, Christopher W. Fraser and David B. Hanson. Ber^arrtn/Cum^ 

5 min^ Publishing Ca, Inc.. Redwood City. 1995. espedaHy at pp 373-407. Jburg takes as input the source code trees 
and partitions this input into chunks that correspond to instructior^ on the target processor. This partition is termed a 
tree cover. This approach is essentially detennined tjy the user-defined patterns allowable for a chunK and is r^atively 
complex; it involves a bottom-up notching of a tree with patten^ recording aD possible matches, fbltowed by a top- 
down reduction pass to detennine wttch match of patterns provides the lowest cost Again, this approach requires a 

10 significant initial constraint in the fonm of the predefined set of allowable patterns, and does not fully realize the possi-' 
bilitiesof areconf'^uratrfearchitectura . 

[0010] There is thus a need to develop techniques and approaches to further 

terns involving a primary and secondary processor, by which an optimal choice can be made for allocation of code b a 
secondary processor, which can then be configured as efHdentiy as possS^e to run 

IS maximising the performarKe efftci WK;y of the primary and secondary pnxessor syst^ in execution of input code. 
[0011] Accordingly, the invention provides a method of compiling source code to a primary and a secondary proces- 
sor, comprising: selective extraction of dataflows from the source code:transfbnnation of the extracted dataflows into 
trees; netching of the trees against each other to detenmine ntinimum edit cost relationsh^ for transfbrmation of one 
tree into another;determinfrig a group or a plurality of groups of datafloyvs on the t^asis of saki minimum edit cost rela- 

20 tionships and creating for each group a generic dataflow capable of supporting eac^ dataflow in that group; using the 
generic dataflow or dataflows to determine the hardware configuration of the secondary pnscessor; and sirtjstituting into 
the source code calls to tiie secondary processor for said group or plurality of groups of dataflows, and compiling the 
resultant source code to the primary processor. 

[001 2] This approach allows for optimal selection of source code dataflows for allocation to tiie secondary processor 
25 witfiout prejudgement of suitability (by. for example, mapping onto predetemtined templates) Ixjt while stiO taking full 
accourtof the depends ard requirements of the secoiTdary processor ard^^ Ad\«ntageousIy. said minimum edit 
cost relationships are detemiined according to tiie architecture of the secondary processor, and repress a hardwvare 
cost of a corresponding reconfiguration of the secondary processor. The method is particulariy effective if the minimum 
edit cost relationships are embodied in a taxonomy of minimum edit dfetances for classification of the trees. 
30 [0013] The method finds its most usefuJ application, where tiie hardware configuration of the secondary processor 
allows for reconfiguration of the secondary processor during execution of the source code, as this allows for reconfigu- 
ration of the secondary processor to be required during execution of the source code to support each dataflow in the 
group supported by a generic dataflow. The secondary process^ may thus be an application specific instruction proc-^ 
essor. and the processor hardware may be a field programmable gate array or a field programmable arithmetic array 
35 (such as tiiat shown in the CHESS architecture discussed in Appendix A). 

[001 4] Advantageously, the generic dataflow of a group is calculated by an approximate mapping of dataflows in tiie 
group on to each other, followed by a merge operation. 

[001 SI An advantageous approach to ccMTStruction of a generic dataflow is to provide ttie dataflows as directed acy- 
dical graphs and reduce them to trees by removal of any links in the directed acydical graphs not present in a critical 
40 path between a leaf node and the root of a directed acydical graph, wherein a critical path.is a path between two nodes 
which passes tiirough ttie largest number of intemiediate nodes. Alternative criteria to ttie critical path can be adopted 
if more appropriate to the secondary processor hardware (for exaniple, if a different criterion can be found which is more 
sensitive to the timing of operations in the secondary processor). 

[0016] An advantageous further step can be taken after the creation of a generic dataflow.' in which the generic data- 
45 flow is conpared witfi further dataflows exlraded from the source code, where'm tfiose of said further dataflows which 
match suffidentiy dosely tiie generic dataflow are added to the generic dataflow. This enables tmre or all of tiie code 
present in tiie source code which is suitable for allocation to tfie secondary processor to be so allocated. 
[0017] In tiie approaches indicated above, the removed links are stored after tiie directed acydical graphs are 
reduced to trees and are reinserted into ttie generic dataflow after the merging of tiie trees of the groip into the generic 
so dataflow. 

[0018] Specific embodiments of tfie invention are descrtoed below, by way of example, wvith reference to tfie accom- 
panying drawings, of which: 

Figure 1 shows a general purpose computer architedure to which embodiments of the invention can suitably be 
ss applied: 

Rgure 2 shows schematically a method of compiling source code to a primary and a secondary processor accord- 
ing to an embodiment of tiie invention; 

Rgure 3 illustrates a step of convasion of a DAG to a tree employed in a mettiod step according to one embodi- 
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mertt of the invention; 

Figure 4a illustraies the step off. insertion and deletion of nodes and Figure 4b fflustrmes the st^ of siitostitution of 
nodes in a tree matching process employed in a method step according to an embodiment of the invention; 
Figtwe 5 shows an edit distance taxonomy provided in an example according to an embodiment of the inventiOT; 
5 Figure 6 illustrates a generic datafbw provided in an example according to one embodiment of the invention; . ' 
Rgure 7 shows a logical interlace for allocation of secondary fMX)cessor resources for a generic dataflow according 
to an embodiment of the mvention; 

Rgure 8 shows ttie application of DAGs to dataflows indudirig multqDlexers to handle conditional statements; and 
Rgures 9a to 9d show an iOustration of the merging of candidatedataf tows to fon^ 
TO according to an embodimerrt of the inyentiOT. 

[0019] The presait inverition is adapted for compilation of source code to an architecture comprising a primary and 
a secondary processor. An example of such an architecture is shown in Figure 1 . The primary jjrocessca- 1 is a conven- 
tional general-purpose processor, such as a Pentium II processor of a personal conputer. Receiving calls from the pri- 

15 mary processor 1 and returning responses to it are secondary processors 2 and (optipnally) 4. Each secondary 
processor 2.4 fe adapted to increase the conputational power and effciency of the archit«Aure handling parts of the 
source code not weD handled by the primary processor 1. SecorKlary processor 4, optionally present here, is a dedi- 
cated coprocessor adapted to handle a specific function (such as JPEG. DSP or the like) - the staicture of thfe coproc- 
essor 4 wni be determined by a manufacturer to handle a specific frequently used function. Such coprocessors 4 are 

50 not the spedfic suliiect of the present appfication. By corttrast. the secondary processor 2 is not already optin^sed for 
a specific function, but is instead corrfiguraWe to enable improved handling of parts of the source code not well harxlled 
by the primary processcw. The secondary processor 2 is advantageously an application specific structure: it can be a 
conventional FPGA. such as the XiOnx 4013 or any other mender of the Xflinx 4000 series. Ah alternative dass of 
reconfiguraWe device, referred to as a field programmable arithmetic anay. is described in Appendix A hereto SiKrfi a 

^ secondary processor can be configured for high computational effidency in handling desired parts of the source code 
for an application to be executed by the iarchitectura 

[0020] Also employed in the computer architecture are memory 3. accessed by the primary processor 1 and for 
appropriate types^of secondary processor 2. by the secondary processor 2. and input/output chann^ 5. Input/output 
channel 5 here represents all further channels and hardware necessary to enable the user to interact with the proces- 

30 SOTS (for example, by programming) and to allow the processors to interact with all other parts of the conputer device 6. 
[0021] The present invention is porticulariy relevant to the optimised partitioning of source code between primary 
processor 1 and secondary processor 2. which allows for optimal configuration of secondary processor 2 to optimise 
the handling of the application embodied in the source code by the ardiHecture. A significant contribution is made by 
the invention in the selection and extraction of code for use in the secondary processor. 

35 [0022] The approach taken, according to an ©nbodiment of the invention, is set out in Figure 2. The initial input to the 
process is a body of source code. In principle, this can be in any language : the example descrtoed was can-ied out on 
C code, but the person sWned in the art will readily understand how the techniques desaibed could be adopted with 
other languages. For exanple. the source code could be Java byte code: if Java byte code could be so handled the 
architecture of Figure 1 could be particularty well adapted to directly receiving and e)®cuting sotrce code receved from 

40 the internet. 

[0023] As can be seen from Rgure 2. the first step in the process fe the identification of appropriate cancSdate code 
to be executed by the secondary processor 2. Typically this is done by pertbrming dataflow analysis on the source code 
and building appropriate representations of the dataflows presented by selected lines of code (in most processes this 
IS nomially preceded by a manual piof ifing of the code). This is a standard technique in compfling generally, and appB- 
cation to secondary processors is discussed in. for example. Athanas et al. "An Adaptive Harefware Machine Architec- 
ture and Compiler for Dynamic Processor Reco^figuration^ IEEE International Conference on Conputer Desiqn 1991 
pages 397-400. » • « 

[0024] The approach taken here is to build directed acydical graphs (DAGs) which represent the dataflows of selected 
code^ advantageous way to do this is by using a compiler infrastmcture appropriately configured for the extraction 
ordat^lows: an appropnate compiler infrastructure is SUIF. developed by the University of Stanford and documented 
ectensively at the Worid Wide Web site httpV/suif.stanforxLedu/ and elsewhere. SUIF is devised for conpiler researdi 
tor high^erformance systems, ^edfically induding systems comprising more than one processor. A standard SUIF 
If 5f 1^^"^^ ^ *° h is then a sinple process for one skilled in the art to use SUIF tools to 

« m!«« ^ ^ perfonning a dataflow analysis over sections of SUIF and then recording the results of the analysis 
1002^ ""^^^^^'o^o* DAGs from source code IS a convertional step. Th^ 

from Rgure 2. is the conversion of these DAGs into trees. This step is a significant factor in making the optimal droice 
or code for execution by the secondary processor 2, DAGs are complex structures, and difficult to analyse in an effective 
manner. Reduction of DAGs to trees allows the aspects of the dataflows most important in detemiining their mapping 
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to harchware to be retained, while simplifying the structure sufficiently to allow analytical approaches to be nrade signif- 
icantly more effective. 

[0026] Discussion of the reduction of DAGs to trees is made in "High Performance CcMTtpilers fbr Parallel Computing* 
{as dted above), specially at pages 56 to 60. Different tarninology is used here from that used in the cited reference, 
s but eqidvaient and conparable temre are indicated below. The type of trees constructed ho-e are directly comparable 
to the "spanning tre^** referred to in the cited rieferwice. 

[0027] The preferred approach followed in the redudbn of DAGs to trees the removal of links not in the critical path 
between leaf nodes and the root: this is illustrated in Rgure 3. The critical path between nodes A arxj B b in a first 
enijodiment of trt s reduction process defined as the one ttiat touches the maximum numbW of nodes. As a DAG is, by 
10 definition, acyclic, distinct paths can be defined to meet this criterioa ft is possftDle for there to be different paths 
between nodes thai have the same niaxinujm number of nodes. t)ut these 

purpose of tree constructioa While making an ait»trary seJectiai t>etween these paths is a valid apprx>ach, a key esue 
In mapping the source code successfully is scheduling, which depends on timing irtformation: accordingly, where it is 
necessary to mate a dioioe between atternalive "critical paths" it is desirable to choose the one that would take the 
IS longest time (in terms of time taken to execute each of the operations r^esented t>y the nodes in the path). As is dis- 
cussed further below, alternative approaches can be adopted which are tased more directly on timing tnformatioa It ts 
also desiraUe to aidiopt a consistent approach in making such choices - othenmse morphologically (fifferCTt trees can 
resutt from esseritially amilar DAGs. 

[0028] The process taken in applying this first en*odimert of the crrt^ 

so leaf node, every posstole path tcwards the root is chased: as the DAG is a directed gr^h, this is straightforward. As 
indicated above, for each leaf node the path with the greatest nuniber of nodes is chosen, and if two paths are found to 
have the same number of nodes, a sdectiori is made. This is the critical path for that leaf noda All other paths not 
selected are cut in their edge closest to the starting point This cut edge is termed a minor Onk (equivalent to frie term 
"cross-Rnk* in the Wdfe reference). The tree consists of the assembly of critical paths, and contains no ninor links. The 

25 minor links are stored separately. Minor links will be required when extracted source code is mapped to secondary proc- 
essor 2. t»ut are rxrt used in deternriining which source code is to l>e nr^^ 

[0029] ft is of course possftjle to construct aces from DAGs wfthout using the critical path criterion. Use of the critical 
path does provkie particular advantages. In partkailar, removal as minor finks of the cross-finks not in the critical path 
will have little effect on scheduling, whereas if another approach was adopted removed cross-links may have a conad- 

5a erable infbence on timing and hence on scheduling. Use of the critical path criterion aflows construction of a tree which 
represents as best possible the critical features of the DAG in the context of mapping to hardware. 
[0030] Rgure 3 shows the application of the process described in the preceding paragraph. Source code extract 1 1 
shows three lines under consideration for executfon by secondary processor 2. DAG 1 2 shows these three lines of code 
represented as adirected acydical graph, wrth root 126 (variable e) and leaf nodes 121, 129 and 130 as the iiputs. 

35 [0031] ft is now a straightfonwand matter to assess each path from a given leaf node to the root, eind to conpare the 
nunrfcer of nodes in each path. Rom rxxie 129 (integer value 2), there is only one path, through nodes 122, 123, 124 
and 125. This is frien the critical path from leaf node 129 to root node 126. and vwll be'present in the tree. f=rom node 
121 (in the present case the resuft o* an eariier operation and designated c), there are two paths. Thefirst path passes 
thnoughnodes 122, 123, 124and 125, whereas the second path passes through nodes 127, 128 and 125. Thefirstpath 

40. is the critical path, as it passes through more nodes: the second path can thus be cut, as is discussed below. The 
rOTiaining leaf node 130 (variable b) also has twvo paths available: one passes through nodes 123, 124 and 125. 
whereas the other passes through nodes 127, 128 and 125. These are equivalent in temis of number of nodes and so 
either path can be chosen as the critical path: however, for reasons discussed afciove (timina and morphological con- 
sistency) ft is desifBt^e to operate under an appropriate set of further rules to make the best selecfioa Such further 

45 rules may, for example, be det©-mined on the l>asis of the relevant hardware Here, the second path is chosen. 

[0032] The next step to take is to construct a tree 1 4 from the critical patfis chosen from the DAG 1 2 This is done by 
cutting all noncrrtical paths in their edge closest to the starting pant (that is, the edge closest to the starting point which 
is not also part of a crrtical path). The first non-critical path to consider is that from node 121 to root 126 through nodes 
127. 128 and 125. this can be cut on the edge between nodes 121 and 127 - in the tree, this is represented by removal 

so of edge 151 between nodes 141 (ccaresponding to 121) and 147 (corresponding to 127) which is stored separately as 
a minor link. The other non-critical path to consider is that from node 130 to root 126 through nodes 123, 124 and 125: 
this can be cut on the edge between nodes 1 30 and 1 23. Again, this cut edge is stored as a ntinor Bnk. 
[0033] ft should b>e noted that condftionals can t>e represented in DAGs and so reduced to trees in exactly the same 
way as simple equations. An example is shown in Rgure 8: this is a DAG representing the dataflow of the Ones. 
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if(x < 2) 

a = b 

■ ' ■ . else 

and shows a muMplexer node 186 and a ness than' operation node 186 in adcGtioh to the variable and integer nodes 
181. 182. 183 and 184. As the skilled man wiO appreciate, it vrill generally be possible to use the apprtach shown here 
; for source code which can be represented as a DAGL 
lOOSfl The tree structure tto is left -in this case; tree 14- is ainuich easier struct 

source code should be mapped to secondary Focessor 2, as is dscussed fur«ier below. The technique descrfljed 
above IS a parliculany appropriate one for converting DAGstotrees. as it is straightfoiwaid to implement, is general in 
appncatioa and through use of the critical path maintains the maximum "depth" of the oomputafional engine to be syn- 
thesised (assuming each node r^resents a single oomputafional dement) because of the inclusion of paths with the 
maximum number of nodes. As the person skilled |n the art wiO appreciate, alternative appix>ach» to detemiining which 
edges are to be removed in conyerfing the DAGs into trees can be adopted. One altemative embocfiment of theOAG fo 
tree reduchon process is to assign a timing-based weight to every node (based, for exanple. on the length of time 
required to wecute the corresponding computational elemenQ and then to compare the accumulated weights of each 
path, selecting a path to define the tree aoooidingly on the basis of. for example, greatest accumulated weight This 
approach nay be more appropriate if the fiming parameters of the secondary processor 2 will be a crifical prartical fac- 
tor and in particular if the timing dependencies are not mainly related to the mode counted (which may the case in stnic- 
tures where, for example, rmiKipBcation is several times nrore time consuming than adbBtion^^ 
[0035] The next step in the compilafion process, as can be seen from Rgure 2. takes trees as Inputs and detenninre 
the selection of source code for the secondary processor 2. As is further iOusirated in Rgure 2. this step of the process 
comprw^ seri^of sub-steps. The fiist of these is the analysis and dassificafion of the trees resulting from the can- 
didate dataHows. This is a signTicant original step, and is discussed in detaU below. 

[OWq Tlie elective in this stage of the compilation process is to detennine as possible which of the candidate 
dataftows from the source code would be the best choices for execution by the secondary processor This is to a larae 
d^ree dependent on the nature of the hardware in the secondary processor. An extremely efficient mapping of source 
code to tfie secondary processor 2 can be made where dataflows are suffidenfly similar that broadly the same hard- 
ware representation can be used for each dataflow. It therefore fbOows that good choices of candidate dataflows for 
mapping to the secondary processor can be made by finding sets of dataflows that are sufficiently similar to each other 
Ttiis IS what IS achieved by analysing and dassifying the trees resulting from the candidate dataflows. 
miTl A powerful technique for matching trees, used in this embodiment of the invenfion. is the tree matchina alao- 
nthm devised by Kaizhong Zhang of the University of West Ontario. Canada. 

. [0038J ^ This algorithm is described in Kaizhong Zhang. "A Constrained Edit Distance Between Unordered Labelled 
Trees .Algorithmic (1996) 15:205-222. Springer Vertag. and is provided as a toolWt by the University of West Ontario 
the toolkit being at the tme of writing obtainable over the Internet from Hpy/ftp.csd.uwaca/t)uWk2hangn-REEtool tar oz* 
It win be appredated that alternafive approaches of matehing trees to determine a degree of dmaarity therebetween are 
avQMableto the skriled man. The approadi to tree matehing used in this embodiment of the invertion is described below 
VmS] The pTinople of operation of Zhang's algorithm is the following: two trees are compared node-by-node througfi 
aiVramic programming tedinique that minimises the edit operations required to transfonn one tree into another This 
coA of transformation is ternied here an edit cost. The edit costs of successively laiger subtrees are cross-compared 
with a record being kept of the minimum costs found. The computational stmcture can be diaracterised as that of a 
reajTsivedynamic program whidi uses a working dynamic programming grid to calculate component subtree distances 
and records the result on the main grid. 

mm The edit operations available are insertion, deletfon and substituHon. These are shown in Rgures 4a and 4b 

f^^^'l^^^ '^^"°^*i^^en nodes Z and S Of ^S1: this new node gives the structure of tree 152 Con- 
sequ^y transfomiation of tree 151 to tree 152 is adiieved by Insertion of this node, and transformation of tree 152 to 
h^!rJ^>^^ by deletfon of it On the CHESS ardiitecture described in Appendix A. "deletion" is represented in 
naraware by "bypass of a unH of the array: this is an example of an architecturany designed cost - in this case a 
TS^l Z "^^^ ^° frees 1 51 and 1 52 have the same strudu!e. tJL t^odi 4 reJSem 

a different type of operation in eadi tree: it is therefore necessary to substitute for node 4 in transforming one tree to 
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the other. Every node therefore needs a label": a tag attached to the node which identifies the type of node among the 

varioie. types of node posstole. 

[0041] As previously indicated, each of these edit ope^ 

fa example, the same result may t>e achieved in some architectures either tjy an insertion and a deletion, or by a sul>- 

5 stitution: the costs of these different alternatives can be compared 

[0042] The result of the comparison of two trees by thfe algorithm is the production of a fist of j^irs of nodes (tl ,t2). 
where t1 belongs to the first tree and t2 belongs to the second tree. Each pairing constitjtes an identification of similar 
points in the two trees, suggesting the mapping of tl arcJ t2 on to each other. The list of pairs effedBvely defines the 
skeleton of a tree which can contain either of the compared trees: in tiiis skeleton, to transform the first tree into the 
w second tree, each node tl has t3 be substituted witii the respective t2. Nodes that do not occur in the mappingmust be 
either inserted or dieted d^Dending on whidi tree th^ belong to. as is discussed further below. For this list of pairs 
tha^e win be defined an edit distance: this is the minimum h ecfit costs cumulated over the pairs necessary to transform 
one tree to the other. The algorithm is devised to deterntine an edit di^anoe t)6tween two trees, together with tiie set of 
transformatiorB which achieves that edit distance: alternative transformations wfll be possftrie. t>ut they will have a 

15 higher associated cumulative edit cost, 

[0043] The value of computing an.edrt distance based on edit costs is that the edit coste may be chosen to r^esent 
the "hardware cost" in reconfiguring the secondary processor from the configuration representing one tree to a config- 
uration representing the other tree in a mapping. This "hardware cost" is typically a measure of the quantity of second- 
ary processor resources that will be taken tp to achieve tiie second configuration given the existence of tfie first - tiiis 

so can be considered, for exanple. in terms of the additional area of device used. These costs will be determined by the 
nature of the secondary processor hardware, as for different types of hardware the physical realisation of insertion, 
deletion and substitution operations will be different For tiie reconfiguraWe CHESS array discussed in Appendix A. a 
njypass" opOTtion involves minimal cost, a substitution between an adds and subs (addition and subtraction opera- 
tions) has low cost whereas substitution between muls and divs (muHiplication and division operations) is expensive 

25 [0044] As indicated atx>ve, an edit distance tietween two trees can be consbucted. Howeva-, a further step can be 
taken: using Zhang's algoritiim. or a comparable approach, a taxoncsny can be bum to show tfie edit distances between 
each one of a set of trees. This taxonomy can readily be provided m the fomi of a tree, of which an example is shown 
in Figure 5, Each leaf node 161 of the tree represent a candidate tree extracted from a DAG. and each intermediate 
node 162 represents an edit cost The tree provides a unique patfi between each pair of leaf nodes. The edit distance 

30 between the two leaf nodes of a pair is found by nation of costs provkled at each intermediate node on this patfx For 
exatfnple. the edit distance between any pair of tfie leaf nodes representing Tree#4. Tree#5 or Tree#6 is 6. However, tine 
edit distance tjetweei Tree#1 and Tree#4 is 496: tfie summation of intermediate nodes witfi values of 12, 221 . 1 07 50 
and 6. * * 

[0045] This taxonomy is indicative of tfie number of edit operations required to translate between tf-ees. Such a tax- 

35 onomy is a valuable tool, as it can t>e used heuristically as a metric for tfie degree of variation between carididate trees. 
The creation of a taxonomy tfius renders it easy to detemrtine which trees are suffidentiy similar to be consolidated 
togetfio- (as wfll be discussed below), and which are too diverse for tfi'is purpose. This Sin be done by imposition of an 
edit distance tiireshokJ. A group of trees can be selected for consolidation if tfie edit distance between each id every 
possible pair of trees in tfie group is less \han me edit distance tfireshoW. The value of tfie edit distance threshold is 

<o arbitrary, and can be chosen by ttie person skilled in tfie art in tfie context of specific primary and secondary processors 
in order to optimise ttie performance of the system. 

[0046] The advantage of consolidating a group of trees is tfiat a common hardware configuration can be used for the 
whole ^oup and will support tfie function of eac^ tf ee This is particularly appropriate fpr architectures, such as 
CHESS, in which low-latency partial reconfiguration mechanisms are available on the secondary processor. Reconfig- 

<5 uration is required to change tfie configuration from tfiat to support tfie function of one tree to ttiat to support ttie function 
of another tf-ee: however, as ttie edrt distance between ttiese ti^ees will never be greater tfian tfie edit cost tfueshold. tfie 
degree of recctfifiguration required is already known to l>e wittiin acceptable boundsi The group of trees are correofi- 
dated togettier by construction of a "supertree" wftic^ contains a r^esentation of every component tree. After ft has 
been constructed, ttie supertree can be converted into a representation of each of tiie relevant DAGs extracted from tfie 

so source code by reinsertion of tfie previously removed minor links. The hardware configuration may then be detemiined 
from ttie fuD supertree. The construction of tfie supertf-ee is discussed in detail t)eIow. 

[0047] Rgure 6 illustrates the step of construction qf a supertree from a group of trees which fall below tfie specified 
edit cost threshold: such a group of trees is here X&rme6 a dass. The trees 171. 172 and 173 can all be mapped 
togettier into supertf-ee 1 70. The reconfiguration required to change ttie hardware configuration from tfiat to support for 
55 example, tree 1 71 to that of tree 1 72 is suffidentiy limited to be realizable in practice, because tfie edit distance between 
the two trees is below the edit cost threshold. 

[0048] An exemplary supertree assentbly algoritfim. merge, fe provided as C code in Appendix B. The function of ttie 
algorittim is described below, witti reference to Rgure 9, The algorittim contains the following elements: 
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merge: 

[0049] The tree in the dass with the largest number of nodes is chosen to be the initial merge tree - if there are trees 
with an equal number of nodes, an arbitrary selection can be made. The remaining trees are termed source trees. 
[0050] Fa- each source tree the fbllcwing. operations are then appiie^^ 

From the mapping between the source tree and the merge tree which has been calcuteted pn this embodiment, 
from Zhang^ algorithm and edit costs determined from the secondary processor arcWtecture), the supertree is 
constructed as follows: 

l /FirsUy^mtapped nodes dosesl to the root are cor^de^ 

2, The source tree operation (source operation) is concatenated to the correspondingmapped merge tree 
(deration (merge opeatidn); 

3. For each cfiild operation of the source (deration 

a If the child is nrapped/revert to step 2 with respect to the source child 

b. If the child is not mapped, then consider whether there is any mapping In the subtree of which the child 
is the root (source subtree): 

i. If there is no further mapping/simply adopt the source sidjtreefoi- mergipg into the merge tree under 
the corresponding merge tree node. 

fi. If there is a further mapping mside the source subtree, connect the subfr^ 

a If the merge operation of this subordinate mapping faDs outside the previously mapped subtree, 
remove the mapped source operation from the source tree. There is recursion present at thus 
stage - where mapped children have already been dealt with, all that needs to be done is to 
remove what would otherwise be a cross tree link. 

b. ThB is shown in Rgure 9. if the merge operation of this subordinate mapping does fall within 
the previously mapped sufcitree, dimb up the merge tree until the least common ancestor for all 
contained subordinate mappngs is found. The least common ancestor is the first node to contain 
all of tiie source mappings. The unmapped source segment is then mapped into the merge tree 
by linking the source operation of the unmapped source siijtree as a child of the least common 
ancestors parent and by linking the least common ancestor as the chfld of the unmapped source 
operation just above the dosest mapped source operation in the current sutrtree (where tine "dos- 
est mapped source operation* delinrtits the tower end of an unmapped segment of the source tree, 
and is a mapped node which ^lls within the subtree of the current mapping - the source node's 
parent, which is unmapped, adopts the merge tree's least common^ancestor as a child and vice 
versa). 

The pair of intermingled trees are normaUsed into a single tree, wtiich forms the new merge tree. 

The procedure continues until all the source trees in the dass are contained within the merge tree which is now a 

supertree. 

[0051] This process is indicated in Figure 9, Rgure 9a shows two dataflow trees, a merge tree 201 and a source tree 
2(^-niere are three mappings made t>etween nodes made by the comparison algorithm - the remaining nodes need 
to be inserted appropnately As indicated in section 1 above, the first step is to consider the mapped operations nearest 
the root - in this case, at the root These operations A are concatenated. 

[0052] After this, the chiW nodes of A in the source tree are considered. Node B does not have a mapping and is not 
an ancestor to any mappings - it is ttierefore merged as a child of A:A (see Rgure 9b). The other child node of A, C 
does however have descendant mappings (D and F whidi map to D and E in the merge tree). Both the relevant merge 
operations iaW in the previously mapped subtree (as they are both descendants of A). It is therefore necessary to follow 
the course set out in section 3(b){iO(b) above. The least common ancestor containing both mapped merge operations 
D and E IS X. C of the source tree is thus linked into the merge tree as child of A:A (the parent of X) and parent of X 
Ihis arranganem is shown in Rgure 9b - the merging is completed by concatenation or merging of ttie remairting nodes 
of the source tree, all of which steps are straightfonward. 

[0053] The resultant supertree 203 is shown in Rgure 9c. This supertree 203 acts as merge tree for tiie merging in 
of a further candidate source tree 204. as shown in Rgure 9d. In tfnis case each node of the source tree is mapped into 
a supertree node - merging is thus entirely straightforward, and consists only of concatenation (ie substitution) This 
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process continues until all the candidate trees are merged into a sup^ee. 

[0054] At this stage, it is.possiWe to take a sl^ which enables nwe of the source code to be allocated to the second- 
ary processor: The source code wfll contain DAGfe other than those whidi have been selected for irKlusion of the super- 
tree: far example. DAGs which have not beai conadered because they do not lie at one of the most con^utationaOy 
intensive "hot spots* of the code. Howe/er. the code of these DAGs may also run more quickly if e)©cuted on ^ropri- 
ately adapted secondary processor rather than on the primary processor, ft can thus be advarttageous to compare such 
Training DAGs with the supertree by a backmaiDping process. Processes derived from conventional backmapping 
techniques, such as Iburg, can be utiGsed for this purpose. However, the most advantageous approach may be to retiim 
to USB of Zhang's algorithm, and match further candidate trees in the source code against the supertree. but ttts time 
with a lower edit cost threshbW. Where the trees derived from such DAGs can either be mapped directly onto the super- 
tree, or where the edit cost for such a mapping falls below some nrinimum level, then the code of these DAGs can also 
be aflocated to the secaidary processor and the supertree modTied, if necessary. CoTtrol infbmiation related to any 
such dataflows added by this t>ackmapping process needs to be stored also. 

[0055] From this supertree. it is then straightfonwand to reert the nranor finks which were removed from the DAGs on 
their converaon into trees (indufing h^ei any DAGs added from the backmapping process, rf employed). The resulting 
structure is a class dataflow, which r^resents all the information present in the DAGs of the dass: control information 
for the supertree (for example, to determine any reconfiguration that fe to occur) must also be present This dass data- 
flow can be used for the purpose of determining the hardware configuration of the secondary processor and can also 
be used to provide a stmcture for enabOng stitching back into the source code apprc^Jriate calls to the s^ndary Droc- 
essor: these steps are descrtoed further below. 

[0056] Stitching calls to the secondary processor back into the source code in fad requires only the supertree and 
not the dass dataflow, as the supertree prescribes the periphery of thq dataflow. The actions required with resped to 
any r^laced dataflow in the source code are replacement of inputs of the dataflow (leaves of the tree reduced from that 
dataflow) with load primitives and of the output of the dataflow (root of the relevant tree) vwth a read. The leaves and 
roots of the relevant tree are contained in the supertree. so only the supertree is required for the purpose. All remaining 
code subsumed in the dataflow can simply be removed, as it is replaced by the secondary processor configuration 
[0057] Rgure 7 shows a logical interface for achieving the necessary substitutions into the source code. An input tree 
labefled Input Tree #3. is shown, together witti a supertree. labefled PFU Tree. Each node in Input Tree #3 has its owri 
unique operation ID obtained from the conpiler internal form representation. For the supertree (PFU Tree) registers or 
otiier i/O resources are allocated to the leaves and the root The inplidt mapping between Irput Tree #3 areJ PFU Tree 
thus provides a con-espondence between operation IDs of the Input Tree nodes and the I/O resources allocated for PFU 
Tree in ttie form of a specification. The application of this specification in the step indicated as "merge" in Rgure 7 allows 
removal of the code subsumed by the PFU and the substitution of the necessary VO primitives in the code 
[0058] From tiie dass dataflow, it is possible to configure the secondary processor. This step can be conduded 
according to known approadies. by reduction of tiie dass dataftow to a netiist (wit h insert, delete and substitute oper- 
ations, and inducfing in appropriate form any dynamic reconfiguration instructions), and then mapping tiie netiist to tiie 
spedfcc secondary processor hardware, taking into account requirements of reconrigurSiori between component data- 
flows. For conventional FPGA architectures, these steps can be carried out essentially by use of appropriate known 
; tods. For example, in the case of a standard Xilinx FPGA such as the XC4013. tiien appropriate Xilinx proprietary tods 
can be used. Rrsfly, the netiist can be rendered In Xilinx netiist format (XNF). This can then be followed by partitioning 
into configurable logic blocks and input/output blod^s by the Xilinx Partition Place and Route program (PPR) with tiie 
resultant being converted to a configuration bitstream by ttie Xilinx MakeBits program. This approach is discussed 
togetiier with further discussion of provision of predelennined reconfiguration sdutions. in "Run-Time Programming 
Method for Reconfigurable Computer by Steve Casselman. currentfy available on the World Wide Web at 
htlp7/www.reconRg.com/specrepV101596^ession1/libr^ a contrtoution to ttie Worid Wide Web roundta- 

We on reconfigurable computing operated by SB Assodates. Inc. of 504 KGno Avenue. IjOs Gatos CA 95032 USA 
procedures can be fdlowed for altemative types of configurable arid reconfigurable process^ such 
as the CHESS device described in Appendix A. using tools appropriate to the processor concerned 
10059] Once tfie source code is generated in executable forni witfi appropriate calls to tfie secondary processor and 
once tiie secondary processor configuration has been detennined. the soufx;e code can be loaded and executed' The 
source code is executed in ttie primary processor witfi calls to coprocessors and tfie secondary prc>cessor- as tiie sec- 
omJary processor is specifically adapted to pnxjess the dataflows extracted to it. ttie execution speed of ttie code is sig- 
nificantiy increased. For example, a 25% improvement was found in application of tiie mettiod of ttiis embodiment of ttie 
invention to ttie iDCT algoritfim from tfie JPEG toolkit, even ttiough ttiis is in fad a poor problem for mapping to such a 
ss secondary processor because of I/O constrainst yiw!>ucna 
10060] The mettiods here described are ttius particulariy effective to allow for optimal use of ttie secondary processor 
m an archrtedure comprising a primary processor and a reconfigurable secondary processor. 
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APPENDIX A 
CHESS array 

The CHESS array is a vari^ of field programmable array in which the programmable 
elements are hot gates, as in an FPGA, but 4-bit arithm^c logic nnits (ALUs) The array 
configuration is described in d^ail in Eoropeaii Patent Application No. 97300563.0, and the 
ALU structare and provision of instruction to ALUs is discussed in a cpprading s^lication 
entitled "Reoinfigurible Processor Device filed on the same date as the present 
a^lication. 

The CHESS array consists of a chessboard layout with altmiating squares conq>ri5ii]g an ALU 
and a switchbox structure respectively. The configuration memory for an adjacent switc^box 
is held in the ALU. Individual ALUs imay be used in a processing pq)dine, and in a preferred 
in:q>lenient^oru provision is made to allow dynamic provision of instructions from one ALU 
to determine the function of a succeeding ALU, ALUs are 4-bit, with four identical bitslices, 
with 4-bit inputs A and B taken duectly from an extensive 4Tbit mterconnect wfring n^ork, 
and 4-bit ou^ut U provided to the wiriiig rietwork through an optionally latchable output 
register: l-bit carry mput and ou^ut are also provided and haive their own intercoimect. 

Dynamic instructions are providable from the output U of one ALU to a 4-bit instruction input 
I of another ALU. The carry output C^ of one ALU can also be used as Cj^ of another ALU 
with the effect of changing the instruction of that ALU. 

The CHESS ALU is ^dzpted to support multiplexing between A and B ii^uts, and also 
supports multiplexing between related instructions (eg OR/NOR, AND/NAND), 
Reconfiguration between such instructions can be achieved through appropriate use of the 
cmy inputs and outputs without consun5)tion of silicon. More complex reconfigurations (eg 
AND/XOR, Add/Sub) can be achieved through using two ALUs, the first to multqjlex between 
the two alternative instructions and the second to execute the chosen instruction on the 
operands. Multiplication will take up more than a single ALU. making reconfiguration 
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involving a midtq>lication operation It is straig^tfbrwaurd iising the nm^ 

capacity of a CHESS ALU to ^'bypass" an operation, widi appropriate control resulting in 
. cifliCT performance of tiic operation or propagation of a given input 

A sanq)le set of functions obtainable from the instruction inputs is indicated in Table Al 
bdow: a wide range of possibilities are available with £q>propriate logic in connection of the 
instruction inputs to the ALU. Tlie fimctions are described in Table A2. 
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Table Al: Instruction bits and corresponding functions 
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U function 
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ADD 


A plus B 


Ariilim^c carry 
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A minus B 


Arithmetic carry 




AANDB 
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if A == BthenO, else 1 
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40.; 
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Not applicable 


bitwise OR of A and B, 
foUowed by an AND across 
the width of die word 



Table A2: Outputs for instructions 



complement arithmetic is used, and the arithmetic carry is provided to be consistent with 
s arithmetic. The MATCH fimcUons are so-called because for MATCHl the value of 1 is 
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Claims 

1. Amethodof compinngsourcecodetdaprirmiy anda 

selective extraction of dataftows from the sou^ 
transformation of the extracted dataflows into trees; 

matching of the trees against each other to determine minimum edit cost relationships for transformation of one 
tree into another; 

detemiining a grotp ora pIuraTity of groips of dataflows on tiie liasis; of said nrraiimum edit co^ relationships 

and creating for each group a genen'c dataflow capable of supporting each dataffow in tfiat group; 

using the geneic dataflow c»; dataf fows to deterntine the hardware conf iguration of the secondary processor; 

and 

S|Ut>stituting into the source code calls to the secor^jlary processor for said group or plurality of groips of data- 
flows, arid conpiling the resultarn source ocxde to the prinriary p 

2. A method as daimed in daim 1 , wherein said miriirrttim edit cost relationships are embodied in a taxonomy of min- 
imum edit distances for classification of the trees. 

3- A method as daimed in claim 1 or daim 2, wherein said mininuim edit cost relationships are determined according 
to the architecture of the secondary processor, and represent a hardware cost of a c6nresp0nding reconfiguratiori 
of the secorxlary processor. 

4- A nrethod as daimed in any of daims 1 to 3, wherein the hardware configuration of the secondary processor allows 
for reconfiguration of the secondary processor during execution of the source code. 

5. A method as daimed in daim 4, wherein the secondary processor is an application specific instruction processor. 

6- A method as daimed in dairn 4, wherein the secondary processor is a field programmable gate array. 

7. A method as daimed in daim 4, wherein the secondary processor is a field programmable arittimetic array. 

8. A method as claimed in any of daims 4 to 7, wherein reconfiguration of the secondary processor is required during 
execution of the source code to support each dataflow in the groip supported by a generic dataflow. 

9. A nretiTod as daimed in any preceding daini. wherein a generic dataflow of a group is calculated by an approximate 
majping of dataf lows in tiie group on to each other, followed by a merge operation. 

10. A method as claimed in any preceding ciaim. wherein the datafbws are provided as directed acydical grai:^s and 
are reduced to trees by removal of any links in the directed acydical graphs not present in a critical path between 
a leaf node and ttie root of a directed acydical graph. 

11. A method as daimed in daim 10. wherein the critical path is a path between two nodes which passes through the 
largest numtier of intermediate nodes. 

• ■ . ■ . • ' - 

12. A method as daimed in daim 10. wherein the critical path is a path between two nodes witii the greatest accumu- 
lated execution time. 

1 3. A nr«tfK»d as daimed in any of daims 1 0 to 1 2, wherein after the creation of a generic dataffow, the generic dataflow 
is compared with further dataf lows extracted from the source code and provided in tiie manner defined in daim 1 0. 
wherein those of said further dataflows which match suffidentiy dosely the generic dataflow are added to the 
generic dataflow. 

14. A method as daimed in any of daims 10 or claim 13 where dependent on claim 9. wherein the removed links are 
stored after the directed acydical graphs are reduced to frees and are reinserted into the generic dataflow after the 
merging of tiie tre^ of tiie group into the generic dataflow. 
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Figure: 9a Two Dataflow Trees 
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